星巴克毕业项目

简介

这个数据集是一些模拟 Starbucks rewards 移动 app 上用户行为的数据。每隔几天,星巴克会向 app 的用户发送一些推送。这个推送可能仅仅是一条饮品的广告或者是折扣券或 BOGO(买一送一)。一些顾客可能一连几周都收不到任何推送。

顾客收到的推送可能是不同的,这就是这个数据集的挑战所在。

此任务是将交易数据、顾客统计数据和推送数据结合起来判断哪一类人群会受到某种推送的影响。这个数据集是从星巴克 app 的真实数据简化而来。因为下面的这个模拟器仅产生了一种饮品, 实际上星巴克的饮品有几十种。

每种推送都有有效期。例如,买一送一(BOGO)优惠券推送的有效期可能只有 5 天。你会发现数据集中即使是一些消息型的推送都有有效期,哪怕这些推送仅仅是饮品的广告,例如,如果一条消息型推送的有效期是 7 天,你可以认为是该顾客在这 7 天都可能受到这条推送的影响。

数据集中还包含 app 上支付的交易信息,交易信息包括购买时间和购买支付的金额。交易信息还包括该顾客收到的推送种类和数量以及看了该推送的时间。顾客做出了购买行为也会产生一条记录。

同样需要记住有可能顾客购买了商品,但没有收到或者没有看推送。

示例

举个例子,一个顾客在周一收到了满 10 美元减 2 美元的优惠券推送。这个推送的有效期从收到日算起一共 10 天。如果该顾客在有效日期内的消费累计达到了 10 美元,该顾客就满足了该推送的要求。

然而,这个数据集里有一些地方需要注意。即,这个推送是自动生效的;也就是说,顾客收到推送后,哪怕没有看到,满足了条件,推送的优惠依然能够生效。比如,一个顾客收到了"满10美元减2美元优惠券"的推送,但是该用户在 10 天有效期内从来没有打开看到过它。该顾客在 10 天内累计消费了 15 美元。数据集也会记录他满足了推送的要求,然而,这个顾客并没被受到这个推送的影响,因为他并不知道它的存在。

清洗

清洗数据非常重要也非常需要技巧。

需要考虑到某类人群即使没有收到推送,也会购买的情况。从商业角度出发,如果顾客无论是否收到推送都打算花 10 美元,我们并不希望给他发送满 10 美元减 2 美元的优惠券推送。所以可能需要分析某类人群在没有任何推送的情况下会购买什么。

数据集

一共有三个数据文件:

  • portfolio.json – 包括推送的 id 和每个推送的元数据(持续时间、种类等等)
  • profile.json – 每个顾客的人口统计数据
  • transcript.json – 交易、收到的推送、查看的推送和完成的推送的记录

以下是文件中每个变量的类型和解释 :

portfolio.json

  • id (string) – 推送的id
  • offer_type (string) – 推送的种类,例如 BOGO、打折(discount)、信息(informational)
  • difficulty (int) – 满足推送的要求所需的最少花费
  • reward (int) – 满足推送的要求后给与的优惠
  • duration (int) – 推送持续的时间,单位是天
  • channels (字符串列表)

profile.json

  • age (int) – 顾客的年龄
  • became_member_on (int) – 该顾客第一次注册app的时间
  • gender (str) – 顾客的性别(注意除了表示男性的 M 和表示女性的 F 之外,还有表示其他的 O)
  • id (str) – 顾客id
  • income (float) – 顾客的收入

transcript.json

  • event (str) – 记录的描述(比如交易记录、推送已收到、推送已阅)
  • person (str) – 顾客id
  • time (int) – 单位是小时,测试开始时计时。该数据从时间点 t=0 开始
  • value - (dict of strings) – 推送的id 或者交易的数额

星巴克项目

在此 notebook 中,将处理 Starbucks rewards 移动 app 上用户行为的模拟数据。

目录

I. 探索性数据分析
II.顾客行为统计学分析
III.机器学习建模
IV.结论和总结

In [2]:
import pandas as pd
import numpy as np
import math
import json
import plotly.express as px
import scipy.stats as stats
from datetime import datetime

% matplotlib inline
In [11]:
import warnings
warnings.filterwarnings('ignore')
In [ ]:
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

第一部分:探索性数据分析

通过以下字典和单元格了解数据的描述性统计信息。

In [3]:
# 查看30 天试验期间的推送(10 个推送 X 6 个字段)

portfolio
Out[3]:
channels difficulty duration id offer_type reward
0 [email, mobile, social] 10 7 ae264e3637204a6fb9bb56bc8210ddfd bogo 10
1 [web, email, mobile, social] 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 bogo 10
2 [web, email, mobile] 0 4 3f207df678b143eea3cee63160fa8bed informational 0
3 [web, email, mobile] 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 bogo 5
4 [web, email] 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 discount 5
5 [web, email, mobile, social] 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 discount 3
6 [web, email, mobile, social] 10 10 fafdcd668e3743c1bb461111dcafc2a4 discount 2
7 [email, mobile, social] 0 3 5a8bc65990b245e5a138643cd4eb9837 informational 0
8 [web, email, mobile, social] 5 5 f19421c1d4aa40978ebb69ca19b0e20d bogo 5
9 [web, email, mobile] 10 7 2906b810c7d4411798c6938adc9daaa5 discount 2
In [4]:
# 查看3个不同dataframe的大小

portfolio.shape, profile.shape, transcript.shape
Out[4]:
((10, 6), (17000, 5), (306534, 4))
In [5]:
# check null value

portfolio.isnull().sum(), profile.isnull().sum(), transcript.isnull().sum()
Out[5]:
(channels      0
 difficulty    0
 duration      0
 id            0
 offer_type    0
 reward        0
 dtype: int64, age                    0
 became_member_on       0
 gender              2175
 id                     0
 income              2175
 dtype: int64, event     0
 person    0
 time      0
 value     0
 dtype: int64)
In [6]:
# 交易数据-事件记录 (306648 个事件 x 4 个字段)

transcript.head()
Out[6]:
event person time value
0 offer received 78afa995795e4d85b5d9ceeca43f5fef 0 {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'}
1 offer received a03223e636434f42ac4c3df47e8bac43 0 {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}
2 offer received e2127556f4f64592b11af22de27a7932 0 {'offer id': '2906b810c7d4411798c6938adc9daaa5'}
3 offer received 8ec6ce2a7e7949b1bf142def7d0e0586 0 {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}
4 offer received 68617ca6246f4fbc85e91a2a49552598 0 {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'}
In [7]:
# 查看顾客统计数据-得到奖励的顾客的dataframe (17000 个用户 x 5 个字段)

print(profile.head())
print(profile[profile['gender'].isna()].shape) # if gender is null
print(profile[profile['age'] == 118].shape) # if age is null
print(profile[(profile['age'] == 118) & (profile['gender'].isna())].shape) # if both gender and age are null
   age  became_member_on gender                                id    income
0  118          20170212   None  68be06ca386d4c31939f3a4f0e3dd783       NaN
1   55          20170715      F  0610b486422d4921ae7d2bf64640c50b  112000.0
2  118          20180712   None  38fe809add3b4fcf9315a9694bb96ff5       NaN
3   75          20170509      F  78afa995795e4d85b5d9ceeca43f5fef  100000.0
4  118          20170804   None  a03223e636434f42ac4c3df47e8bac43       NaN
(2175, 5)
(2175, 5)
(2175, 5)

1. clean the customer dataframe

In [38]:
profile = pd.read_json('data/profile.json', orient='records', lines=True)
In [39]:
# since gender and age both are null at same time, so delete those records

df_customers = profile.dropna()
print(df_customers.shape)

assert profile.shape[0] - df_customers.shape[0] == 2175
(14825, 5)
In [40]:
# process membership duration
# add new column called 'membership_since' to keep track of membership

til = '20191231'
d2 = datetime.strptime(til, '%Y%m%d')

# lambda function to process datetime
dur = lambda v: abs((d2 - datetime.strptime(str(v), '%Y%m%d')).days)

df_customers['membership_since'] = df_customers['became_member_on'].apply(dur)

df_customers.head()
Out[40]:
age became_member_on gender id income membership_since
1 55 20170715 F 0610b486422d4921ae7d2bf64640c50b 112000.0 899
3 75 20170509 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0 966
5 68 20180426 M e2127556f4f64592b11af22de27a7932 70000.0 614
8 65 20180209 M 389bc3fa690240e798340f5a15918d5c 53000.0 690
12 58 20171111 M 2eeac8d8feae4a8cad5a6af0499a211d 51000.0 780

2. process transaction data - value column in transaction dataframe

In [60]:
# merge transcript with clean customer dataframe

df_trans = pd.merge(transcript, df_customers, left_on='person', right_on='id')

del df_trans['id']
In [61]:
# rename 'offer id' to 'offer_id' in value column 

def proc_col_value(text):
    '''
    INPUT
    text - the text inside the 'value' column
    
    OUTPUT
    f - rename if the text has'offer id' to 'offer_id'
    text - otherwise remains unchanged
    
    '''
    f = {}
    for k, v in text.items():
        if k == 'offer id':
            f['offer_id'] = v
        else:
            return text
    return f

# apply func to each cell and extend to 3 columns (offer_id, amount, reward ) 
df_tnx = pd.concat([df_trans, df_trans['value'].apply(proc_col_value).apply(pd.Series)], axis=1)

print(df_tnx.shape)

df_tnx.head()
(272762, 12)
Out[61]:
event person time value age became_member_on gender income membership_since offer_id amount reward
0 offer received 78afa995795e4d85b5d9ceeca43f5fef 0 {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 75 20170509 F 100000.0 966 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN NaN
1 offer viewed 78afa995795e4d85b5d9ceeca43f5fef 6 {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 75 20170509 F 100000.0 966 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN NaN
2 transaction 78afa995795e4d85b5d9ceeca43f5fef 132 {'amount': 19.89} 75 20170509 F 100000.0 966 NaN 19.89 NaN
3 offer completed 78afa995795e4d85b5d9ceeca43f5fef 132 {'offer_id': '9b98b8c7a33c4b65b9aebfe6a799e6d9... 75 20170509 F 100000.0 966 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN 5.0
4 transaction 78afa995795e4d85b5d9ceeca43f5fef 144 {'amount': 17.78} 75 20170509 F 100000.0 966 NaN 17.78 NaN

3. merge offer information to new transaction dataframe

In [63]:
# merge the new transaction dataframe with portfolio based on offer id

df_x = pd.merge(df_tnx, portfolio, left_on='offer_id', right_on='id', how='left')

del df_x['id']
#del df_['value']

print(df_x.shape)
df_x.head()
(272762, 17)
Out[63]:
event person time value age became_member_on gender income membership_since offer_id amount reward_x channels difficulty duration offer_type reward_y
0 offer received 78afa995795e4d85b5d9ceeca43f5fef 0 {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 75 20170509 F 100000.0 966 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN NaN [web, email, mobile] 5.0 7.0 bogo 5.0
1 offer viewed 78afa995795e4d85b5d9ceeca43f5fef 6 {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 75 20170509 F 100000.0 966 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN NaN [web, email, mobile] 5.0 7.0 bogo 5.0
2 transaction 78afa995795e4d85b5d9ceeca43f5fef 132 {'amount': 19.89} 75 20170509 F 100000.0 966 NaN 19.89 NaN NaN NaN NaN NaN NaN
3 offer completed 78afa995795e4d85b5d9ceeca43f5fef 132 {'offer_id': '9b98b8c7a33c4b65b9aebfe6a799e6d9... 75 20170509 F 100000.0 966 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN 5.0 [web, email, mobile] 5.0 7.0 bogo 5.0
4 transaction 78afa995795e4d85b5d9ceeca43f5fef 144 {'amount': 17.78} 75 20170509 F 100000.0 966 NaN 17.78 NaN NaN NaN NaN NaN NaN
In [ ]:
# final transaction dataframe - df_x
In [64]:
# confirm the number of customers before and after join customers' profile dataframe

print(len(transcript['person'].unique()))

print(len(set(transcript['person'].unique()) & set(df_x['person'])))
17000
14825

4. aggregate information to customers data frame

In [70]:
# aggreget amount and reward information to each customer

s1 = df_x.groupby('person')['amount'].sum()
s2 = df_x.groupby('person')['reward_x'].sum()

df_ = pd.merge(s1, s2, on='person')

df_cu_tmp = pd.merge(df_customers, df_, left_on='id', right_on='person')

print(df_cu_tmp.shape)
df_cu_tmp.head()
(14825, 8)
Out[70]:
age became_member_on gender id income membership_since amount reward_x
0 55 20170715 F 0610b486422d4921ae7d2bf64640c50b 112000.0 899 77.01 5.0
1 75 20170509 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0 966 159.27 20.0
2 68 20180426 M e2127556f4f64592b11af22de27a7932 70000.0 614 57.73 7.0
3 65 20180209 M 389bc3fa690240e798340f5a15918d5c 53000.0 690 36.43 19.0
4 58 20171111 M 2eeac8d8feae4a8cad5a6af0499a211d 51000.0 780 15.62 2.0
In [78]:
def proc_offers_and_extends(col, df=df_x):
    '''
    INPUT
    col - the column name to process
    df - the merged the new transaction dataframe

    
    OUTPUT
    df_tmp - the new aggregates information of customers
    
    Description:
    process the column of the new transaction dataframe and aggregate information.
    
    '''
    customers = list(df['person'].unique())
    tmp = {}
    
    for customer in customers:
        # get the aggregate information from new transcation dataframe
        tmp[customer] = df[df.person == customer][col].value_counts().to_dict()
    
    # transpost the dataframe to proper shape
    df_tmp = pd.DataFrame(tmp).T
    
    # reset the index to a column 
    df_tmp.reset_index(level=0, inplace=True)
    
    return df_tmp
In [79]:
df_customers.head()
Out[79]:
age became_member_on gender id income membership_since
1 55 20170715 F 0610b486422d4921ae7d2bf64640c50b 112000.0 899
3 75 20170509 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0 966
5 68 20180426 M e2127556f4f64592b11af22de27a7932 70000.0 614
8 65 20180209 M 389bc3fa690240e798340f5a15918d5c 53000.0 690
12 58 20171111 M 2eeac8d8feae4a8cad5a6af0499a211d 51000.0 780
In [77]:
# find each customer's reponses to the offers in terms of 'offer completed', 'offer received', 'offer viewed'

df_events = proc_offers_and_extends('event')

print(df_events.shape)
df_events.head()
(14825, 5)
Out[77]:
index offer completed offer received offer viewed transaction
0 78afa995795e4d85b5d9ceeca43f5fef 3.0 4.0 4.0 7.0
1 e2127556f4f64592b11af22de27a7932 2.0 4.0 3.0 3.0
2 389bc3fa690240e798340f5a15918d5c 5.0 6.0 6.0 3.0
3 2eeac8d8feae4a8cad5a6af0499a211d 1.0 3.0 2.0 4.0
4 aa4862eba776480b8bb9c68455b8c2e1 3.0 5.0 5.0 6.0
In [76]:
# find each customer's received offers -  'bogo', 'discount', 'informational'

df_offers = proc_offers_and_extends('offer_type')

print(df_offers.shape)
df_offers.head()
(14825, 4)
Out[76]:
index bogo discount informational
0 78afa995795e4d85b5d9ceeca43f5fef 9.0 NaN 2.0
1 e2127556f4f64592b11af22de27a7932 3.0 5.0 1.0
2 389bc3fa690240e798340f5a15918d5c 11.0 6.0 NaN
3 2eeac8d8feae4a8cad5a6af0499a211d NaN 5.0 1.0
4 aa4862eba776480b8bb9c68455b8c2e1 9.0 2.0 2.0
In [81]:
# merge the aggregate information
df_ = pd.merge(df_events, df_offers, on='index')

df_cu = pd.merge(df_cu_tmp, df_, left_on='id', right_on='index')

del df_cu['index']
df_cu.head()
Out[81]:
age became_member_on gender id income membership_since amount reward_x offer completed offer received offer viewed transaction bogo discount informational
0 55 20170715 F 0610b486422d4921ae7d2bf64640c50b 112000.0 899 77.01 5.0 1.0 2.0 NaN 3.0 2.0 NaN 1.0
1 75 20170509 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0 966 159.27 20.0 3.0 4.0 4.0 7.0 9.0 NaN 2.0
2 68 20180426 M e2127556f4f64592b11af22de27a7932 70000.0 614 57.73 7.0 2.0 4.0 3.0 3.0 3.0 5.0 1.0
3 65 20180209 M 389bc3fa690240e798340f5a15918d5c 53000.0 690 36.43 19.0 5.0 6.0 6.0 3.0 11.0 6.0 NaN
4 58 20171111 M 2eeac8d8feae4a8cad5a6af0499a211d 51000.0 780 15.62 2.0 1.0 3.0 2.0 4.0 NaN 5.0 1.0
In [82]:
df_cu.head()
Out[82]:
age became_member_on gender id income membership_since amount reward_x offer completed offer received offer viewed transaction bogo discount informational
0 55 20170715 F 0610b486422d4921ae7d2bf64640c50b 112000.0 899 77.01 5.0 1.0 2.0 NaN 3.0 2.0 NaN 1.0
1 75 20170509 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0 966 159.27 20.0 3.0 4.0 4.0 7.0 9.0 NaN 2.0
2 68 20180426 M e2127556f4f64592b11af22de27a7932 70000.0 614 57.73 7.0 2.0 4.0 3.0 3.0 3.0 5.0 1.0
3 65 20180209 M 389bc3fa690240e798340f5a15918d5c 53000.0 690 36.43 19.0 5.0 6.0 6.0 3.0 11.0 6.0 NaN
4 58 20171111 M 2eeac8d8feae4a8cad5a6af0499a211d 51000.0 780 15.62 2.0 1.0 3.0 2.0 4.0 NaN 5.0 1.0
In [83]:
df_cu.shape
Out[83]:
(14825, 15)
In [84]:
df_cu.isnull().sum()
Out[84]:
age                    0
became_member_on       0
gender                 0
id                     0
income                 0
membership_since       0
amount                 0
reward_x               0
offer completed     2839
offer received         5
offer viewed         150
transaction          333
bogo                1743
discount            1783
informational       5625
dtype: int64

find outliers

In [85]:
df_cu.amount.describe()
Out[85]:
count    14825.000000
mean       117.028155
std        129.965016
min          0.000000
25%         31.450000
50%         87.040000
75%        160.900000
max       1608.690000
Name: amount, dtype: float64
In [217]:
def remove_outliers(col, df):
    '''
    INPUT
    col - the column intend to find outliers
    df - the source dataframe
    
    OUTPUT
    df - the dataframe after remove the outliers
    
    Description:
    using Tukey's Rule
    
    '''

    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    IQR_amount = q3 - q1
    max_value = q3 + 1.5*IQR_amount
    min_value = q1 - 1.5*IQR_amount

    print(min_value, max_value)

    df_ = df[(df[col] <= max_value) & (df[col] >= min_value)]

    return df_
In [313]:
df_cu.head()
Out[313]:
age became_member_on gender id income membership_since amount reward_x offer completed offer received offer viewed transaction bogo discount informational
0 55 20170715 F 0610b486422d4921ae7d2bf64640c50b 112000.0 899 77.01 5.0 1.0 2.0 NaN 3.0 2.0 NaN 1.0
1 75 20170509 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0 966 159.27 20.0 3.0 4.0 4.0 7.0 9.0 NaN 2.0
2 68 20180426 M e2127556f4f64592b11af22de27a7932 70000.0 614 57.73 7.0 2.0 4.0 3.0 3.0 3.0 5.0 1.0
3 65 20180209 M 389bc3fa690240e798340f5a15918d5c 53000.0 690 36.43 19.0 5.0 6.0 6.0 3.0 11.0 6.0 NaN
4 58 20171111 M 2eeac8d8feae4a8cad5a6af0499a211d 51000.0 780 15.62 2.0 1.0 3.0 2.0 4.0 NaN 5.0 1.0
In [315]:
df_cu.shape
Out[315]:
(14825, 15)

顾客的一般信息统计

1. 顾客特征有哪些?各个特征分布如何?

In [234]:
df_cu['age'].describe()
Out[234]:
count    14825.000000
mean        54.393524
std         17.383705
min         18.000000
25%         42.000000
50%         55.000000
75%         66.000000
max        101.000000
Name: age, dtype: float64
In [237]:
df_cu.query("gender == 'M'")['age'].describe()
Out[237]:
count    8484.000000
mean       52.116690
std        17.413894
min        18.000000
25%        39.000000
50%        53.000000
75%        65.000000
max       100.000000
Name: age, dtype: float64
In [238]:
df_cu.query("gender == 'F'")['age'].describe()
Out[238]:
count    6129.000000
mean       57.544950
std        16.883315
min        18.000000
25%        48.000000
50%        58.000000
75%        69.000000
max       101.000000
Name: age, dtype: float64
顾客平均年龄54岁,中位数是55岁,最小18岁,最大101岁,标准差17.38。
其中男性顾客平均年龄52岁,中位数是53岁,最小18岁,最大100岁,标准差17.41。
其中女性顾客平均年龄57岁,中位数是58岁,最小18岁,最大101岁,标准差16.88。
In [588]:
# check the distribution of age based on gender


fig = px.histogram(df_cu, x="age", color="gender", marginal="rug", # can be `box`, `violin`
                         hover_data=df_cu.columns, title="顾客年龄分布")
fig.show()

In [589]:
# the distribution of customers' gender


fig = px.pie(df_cu['gender'].value_counts().to_frame(), values='gender', 
             names=df_t.index, hover_name=['Male', 'Female', 'O'], title="顾客性别分布")
fig.show()

所有顾客中,57.2%是男性, 41.3%是女性

1.1 the relationship between customer age and amount

In [476]:
# the relationship customer age with amount

fig = px.scatter(df_cu, x="age", y="amount", color="gender", 
                 marginal_y="rug", marginal_x="box")
fig.show()

In [477]:
# remove outliers

df = remove_outliers('amount', df_cu)

fig = px.scatter(df, x="age", y="amount", color="gender", marginal_y="rug", marginal_x="histogram")
fig.show()
-162.725 355.07499999999993

In [590]:
fig = px.scatter(df, x="age", y="amount", trendline="ols")
fig.show()

In [224]:
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()
Out[224]:
OLS Regression Results
Dep. Variable: y R-squared: 0.020
Model: OLS Adj. R-squared: 0.020
Method: Least Squares F-statistic: 293.8
Date: Tue, 24 Mar 2020 Prob (F-statistic): 3.30e-65
Time: 14:21:54 Log-Likelihood: -83186.
No. Observations: 14353 AIC: 1.664e+05
Df Residuals: 14351 BIC: 1.664e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 64.5928 2.179 29.641 0.000 60.321 68.864
x1 0.6549 0.038 17.140 0.000 0.580 0.730
Omnibus: 1223.353 Durbin-Watson: 1.990
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1566.890
Skew: 0.809 Prob(JB): 0.00
Kurtosis: 2.963 Cond. No. 187.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

从上面分析可以看出顾客的年龄和购买量有一定的正向关系,R squared: 0.020, P value < 0.05

1.2. the relationship customer income with amount

In [478]:
# the relationship customer income with amount using the data removed outliers

fig = px.scatter(df, x="income", y="amount", color="gender", 
                 marginal_y="rug", marginal_x="histogram")
fig.show()

In [592]:
fig = px.scatter(df, x="income", y="amount", trendline="ols")
fig.show()

In [162]:
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()
Out[162]:
OLS Regression Results
Dep. Variable: y R-squared: 0.099
Model: OLS Adj. R-squared: 0.099
Method: Least Squares F-statistic: 1633.
Date: Tue, 24 Mar 2020 Prob (F-statistic): 0.00
Time: 13:44:13 Log-Likelihood: -92418.
No. Observations: 14825 AIC: 1.848e+05
Df Residuals: 14823 BIC: 1.849e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -6.9582 3.231 -2.154 0.031 -13.291 -0.625
x1 0.0019 4.69e-05 40.413 0.000 0.002 0.002
Omnibus: 11727.817 Durbin-Watson: 2.009
Prob(Omnibus): 0.000 Jarque-Bera (JB): 309932.613
Skew: 3.661 Prob(JB): 0.00
Kurtosis: 24.169 Cond. No. 2.20e+05


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.2e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

从上面分析可以看出顾客的收入和购买量有一定的正向关系,R squared: 0.099, P value < 0.05

2. 顾客的购买行为如何刻画?对不同对推送的反应如何?

In [593]:
# df is the dataframe remove 'amount' outliers

fig = px.scatter(df, x="age", y="offer completed", color="gender", marginal_y="box", marginal_x="histogram")
fig.show()

In [597]:
fig = px.scatter(df, x="age", y="offer viewed", color="gender", marginal_y="box", marginal_x="histogram")
fig.show()

In [595]:
fig = px.scatter(df, x="age", y="offer received", color="gender", marginal_y="box", marginal_x="histogram")
fig.show()

In [479]:
df.head()
Out[479]:
age became_member_on gender id income membership_since amount reward_x offer completed offer received offer viewed transaction bogo discount informational
0 55 20170715 F 0610b486422d4921ae7d2bf64640c50b 112000.0 899 77.01 5.0 1.0 2.0 NaN 3.0 2.0 NaN 1.0
1 75 20170509 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0 966 159.27 20.0 3.0 4.0 4.0 7.0 9.0 NaN 2.0
2 68 20180426 M e2127556f4f64592b11af22de27a7932 70000.0 614 57.73 7.0 2.0 4.0 3.0 3.0 3.0 5.0 1.0
3 65 20180209 M 389bc3fa690240e798340f5a15918d5c 53000.0 690 36.43 19.0 5.0 6.0 6.0 3.0 11.0 6.0 NaN
4 58 20171111 M 2eeac8d8feae4a8cad5a6af0499a211d 51000.0 780 15.62 2.0 1.0 3.0 2.0 4.0 NaN 5.0 1.0

男性顾客的购买行为分析

In [285]:
# create new dataframe to analyze the male customer purchase bahaviors based on age

df_male = df.query("gender == 'M'").groupby("age")[['amount','offer completed','transaction',
                                                'bogo','discount','informational']].mean()
df_male.reset_index(level=0, inplace=True)
In [311]:
# the statics of 'offer completed' before remove outliers

df_male['offer completed'].describe()
Out[311]:
count    83.000000
mean      2.624124
std       0.560579
min       1.833333
25%       2.418319
50%       2.523810
75%       2.686579
max       6.000000
Name: offer completed, dtype: float64
In [312]:
# remove outliers based on 'offer completed'

df_male_ = remove_outliers('offer completed', df_male)
df_male_['offer completed'].describe()
2.0159272418510934 3.088970659799328
Out[312]:
count    77.000000
mean      2.532760
std       0.221051
min       2.046154
25%       2.421053
50%       2.519481
75%       2.653595
max       3.000000
Name: offer completed, dtype: float64
In [465]:
# show the figure of the relationship between age and offer_completed

fig = px.scatter(df_male_, x='age', 
                   y="offer completed", 
                 trendline="ols",
                 hover_data=df_male_.columns, title="男性顾客完成推送交易的平均值")
fig.show()

In [466]:
res_offer_cplt = px.get_trendline_results(fig)
res_offer_cplt.px_fit_results.iloc[0].summary()
Out[466]:
OLS Regression Results
Dep. Variable: y R-squared: 0.306
Model: OLS Adj. R-squared: 0.297
Method: Least Squares F-statistic: 32.62
Date: Tue, 24 Mar 2020 Prob (F-statistic): 2.20e-07
Time: 21:24:26 Log-Likelihood: -8.7825
No. Observations: 76 AIC: 21.56
Df Residuals: 74 BIC: 26.23
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2.0858 0.085 24.475 0.000 1.916 2.256
x1 0.0081 0.001 5.712 0.000 0.005 0.011
Omnibus: 64.019 Durbin-Watson: 1.341
Prob(Omnibus): 0.000 Jarque-Bera (JB): 956.640
Skew: 2.041 Prob(JB): 1.86e-208
Kurtosis: 19.895 Cond. No. 162.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [449]:
df_male_ = remove_outliers('bogo', df_male)
fig = px.scatter(df_male_, x='age', 
                   y="bogo", 
                 trendline="lowess",
                 hover_data=df_male_.columns, title='男性顾客买一送一推送的平均值')
fig.show()
3.7494593408611356 5.529001367522418

男性顾客随着年龄的增加对推送的正向响应有增加对趋势。年龄小于57岁对男性顾客对买一送一对推送有正向关系,年龄在57岁以上此趋势不明显。

In [467]:
fig = px.scatter(df_male, x='age', 
                   y="amount", 
                 trendline="ols",
                 hover_data=df_male.columns,title="男性顾客平均购买量")
fig.show()

In [468]:
res_amount = px.get_trendline_results(fig)
res_amount.px_fit_results.iloc[0].summary()
Out[468]:
OLS Regression Results
Dep. Variable: y R-squared: 0.354
Model: OLS Adj. R-squared: 0.346
Method: Least Squares F-statistic: 44.32
Date: Tue, 24 Mar 2020 Prob (F-statistic): 3.07e-09
Time: 21:25:03 Log-Likelihood: -370.03
No. Observations: 83 AIC: 744.1
Df Residuals: 81 BIC: 748.9
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 50.0255 6.169 8.109 0.000 37.751 62.300
x1 0.6449 0.097 6.657 0.000 0.452 0.838
Omnibus: 78.457 Durbin-Watson: 2.086
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1050.308
Skew: 2.653 Prob(JB): 8.48e-229
Kurtosis: 19.600 Cond. No. 169.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [289]:
df_male['transaction'].describe()
Out[289]:
count    83.000000
mean      8.718089
std       1.238181
min       5.000000
25%       7.865573
50%       8.508876
75%       9.618156
max      12.000000
Name: transaction, dtype: float64
In [473]:
fig = px.scatter(df_male, x='age', 
                   y="transaction",   
                 trendline="lowess",
                 hover_data=df_male.columns, title="男性顾客平均购买次数")
fig.show()

男性顾客随着年龄的增加,平均的购买的交易次数有负相关的趋势,但是60岁以后持平, 并且平均购买量有随年龄增加而增加的趋势。

In [471]:
df_male_ = remove_outliers('discount', df_male)
fig = px.scatter(df_male_, x='age', 
                   y="discount", 
                 trendline="ols",
                 hover_data=df_male_.columns, title="男性顾客折扣推送平均值")
fig.show()
3.6150766203043787 5.656067816304656

In [472]:
res_dsc = px.get_trendline_results(fig)
res_dsc.px_fit_results.iloc[0].summary()
Out[472]:
OLS Regression Results
Dep. Variable: y R-squared: 0.211
Model: OLS Adj. R-squared: 0.200
Method: Least Squares F-statistic: 19.74
Date: Tue, 24 Mar 2020 Prob (F-statistic): 3.06e-05
Time: 21:26:58 Log-Likelihood: -19.929
No. Observations: 76 AIC: 43.86
Df Residuals: 74 BIC: 48.52
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 4.2318 0.099 42.884 0.000 4.035 4.428
x1 0.0073 0.002 4.443 0.000 0.004 0.011
Omnibus: 6.473 Durbin-Watson: 1.842
Prob(Omnibus): 0.039 Jarque-Bera (JB): 5.727
Skew: -0.587 Prob(JB): 0.0571
Kurtosis: 3.655 Cond. No. 162.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

男性顾客随着年龄的增加,对折扣的响应数有正向相关的趋势。

In [445]:
df_male_ = remove_outliers('informational', df_male)
fig = px.scatter(df_male_, x='age', 
                   y="informational", 
                 trendline="lowess",
                 hover_data=df_male_.columns, title="男性顾客信息推送的平均值")
fig.show()
1.86841312857573 2.976513098464318

女性顾客的购买行为分析

In [298]:
df_female = df.query("gender == 'F'").groupby("age")[['amount','offer completed','transaction',
                                                'bogo','discount','informational']].mean()
df_female.reset_index(level=0, inplace=True)
In [299]:
df_female['offer completed'].describe()
Out[299]:
count    84.000000
mean      2.814252
std       0.180662
min       2.000000
25%       2.732839
50%       2.827381
75%       2.938224
max       3.227273
Name: offer completed, dtype: float64
In [474]:
fig = px.scatter(df_female, x='age', 
                   y="offer completed", 
                 trendline="ols",
                 hover_data=df_female.columns, title="女性顾客完成推送交易的平均值")
fig.show()

In [475]:
res_oc = px.get_trendline_results(fig)
res_oc.px_fit_results.iloc[0].summary()
Out[475]:
OLS Regression Results
Dep. Variable: y R-squared: 0.005
Model: OLS Adj. R-squared: -0.007
Method: Least Squares F-statistic: 0.4377
Date: Tue, 24 Mar 2020 Prob (F-statistic): 0.510
Time: 21:30:04 Log-Likelihood: 25.271
No. Observations: 84 AIC: -46.54
Df Residuals: 82 BIC: -41.68
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2.8464 0.052 54.308 0.000 2.742 2.951
x1 -0.0005 0.001 -0.662 0.510 -0.002 0.001
Omnibus: 25.476 Durbin-Watson: 1.921
Prob(Omnibus): 0.000 Jarque-Bera (JB): 56.556
Skew: -1.048 Prob(JB): 5.24e-13
Kurtosis: 6.431 Cond. No. 170.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [444]:
df_female_ = remove_outliers('bogo', df_female)
fig = px.scatter(df_female_, x='age', 
                   y="bogo", 
                 trendline="lowess",
                 hover_data=df_female_.columns, title='女性顾客买一送一推送的平均值')
fig.show()
4.056657766186454 6.0789710630899165

女性顾客随着年龄的增加对推送的正向响应比较平均没有明显的正向或是负向的趋势。pvalue=0.51, 大于0.05,不能拒绝零假设。

In [451]:
df_female_ = remove_outliers('transaction', df_female)

fig = px.scatter(df_female_, x='age', 
                   y="transaction", 
                 
                 trendline="lowess",
                 hover_data=df_female_.columns, title="女性顾客平均购买次数")
fig.show()
3.6195686271021588 13.487389258521315

In [452]:
fig = px.scatter(df_female_, x='age', 
                   y="amount", 
                 trendline="lowess",
                 hover_data=df_female_.columns,title="女性顾客平均购买量")
fig.show()

同男性顾客有这相似的趋势, 女性顾客随着年龄的增加,和平均的购买交易次数有负相关的趋势,但是60岁以后持平。另外平均购买量是随年龄增加而增加, 但是70岁以后有下降的趋势。

In [453]:
df_female_ = remove_outliers('discount', df_female)
fig = px.scatter(df_female_, x='age', 
                   y="discount", 
                 trendline="lowess",
                 hover_data=df_female_.columns, title="女性顾客折扣推送平均值")
fig.show()
3.742277397260276 5.803966894977167

In [454]:
df_female_ = remove_outliers('informational', df_female)
fig = px.scatter(df_female_, x='age', 
                   y="informational", 
                 trendline="lowess",
                 hover_data=df_female_.columns, title="女性顾客信息推送的平均值")
fig.show()
1.9410911339421615 2.97238235667174

60岁以下女性顾客随着年龄的增加,平均对折扣推送有正相关的趋势,但是60岁以后有所下降。48岁以下对女性客户随着年龄的增加,平均对信息推送有正相关的趋势,但是48岁以后趋势不明显。

收入的购买行为的影响

In [552]:
# Group 1 -> income less than 50k, all customers

df_income1 = df.query("income < 50000 ")[['amount',
                                            'offer completed','transaction',
                                            'bogo','discount','informational']]
df_income1.shape
Out[552]:
(3716, 6)
In [553]:
# Group 1_1 -> income less than 50k and male customer

df_income1_1 = df.query("(income < 50000) and (gender == 'M') ")[['amount',
                                                'offer completed','transaction',
                                                'bogo','discount','informational']]
df_income1_1.shape
Out[553]:
(2614, 6)
In [546]:
# Group 1_Female -> income less than 50k and female customer

df_income1_2 = df.query("(income < 50000) and (gender == 'F') ")[['amount',
                                                'offer completed','transaction',
                                                'bogo','discount','informational']]
df_income1_2.shape
Out[546]:
(1053, 6)
In [547]:
# comparison test between the male and female customers whose income are less than 50k

t_test1 = t_test(df_income1_1, df_income1_2)
t_test1
Out[547]:
([12.862324890736579,
  9.079212743087433,
  3.802670315369743,
  4.664089397829167,
  3.833394268111915,
  0.3346341121861026],
 [0.0,
  0.0,
  7.273778606997627e-05,
  1.6052411035971659e-06,
  6.425441343871352e-05,
  0.3689601289928548])

从以上比较可以看出,收入低于50k以下的男顾客和女顾客在消费量,完成推送交易,购买次数上有显著性,可以拒绝零假设。

In [526]:
def t_test(df1, df2):
    '''
    INPUT
    df1 - the dataframe, sample size by feature size
    df2 - the dataframe, sample size by feature size, 
    
    OUTPUT
    t_score - the list of t-score of each feature
    p_value - the list of p-value of each feature
    
    Description:
    The 2 dataframe must has same feature size, the number of columns must be same and comparable.
    
    '''
    # check if comparable - the number of columns
    assert df1.shape[1] == df2.shape[1]
    
    # get the number of features
    fea = df1.shape[1]
    
    # get the sample size - the number of rows 
    n1 = df1.shape[0]
    n2 = df2.shape[0]
    
    
    # get sample variance and sample mean
    var1 = []
    var2 = []
    mean1 = []
    mean2 = []
    for idx in range(fea):
        var1.append(df1.iloc[:,idx].var())
        var2.append(df2.iloc[:,idx].var())
        mean1.append(df1.iloc[:,idx].mean())
        mean2.append(df2.iloc[:,idx].mean())
    
    # get t-score and p-value
    t_score = []
    p_value = []
    for i in range(len(var1)):
        t = (mean2[i]-mean1[i])/np.sqrt(var1[i]/n1 + var2[i]/n2)
        t_score.append(t)
        
        # degrees of freedom
        df = n1 + n2 - 2

        # p-value after comparison with the t 
        p = 1 - stats.t.cdf(t, df=df)
        p_value.append(p)
    
    return t_score, p_value
In [554]:
# Group 2 -> income between 50k and 75k, all customers

df_income2  = df.query("income >= 50000 and income <= 75000")[['amount','offer completed','transaction',
                                                'bogo','discount','informational']]
df_income2.shape
Out[554]:
(6496, 6)
In [555]:
# Group 2_1 -> income between 50k and 75k, male customers

df_income2_1  = df.query("income >= 50000 and income <= 75000 and gender =='M'")[['amount',
                                                'offer completed','transaction',
                                                'bogo','discount','informational']]
df_income2_1.shape
Out[555]:
(3958, 6)
In [556]:
# Group 2_2 -> income between 50k and 75k, female customers

df_income2_2  = df.query("income >= 50000 and income <= 75000 and gender =='F'")[['amount',
                                                'offer completed','transaction',
                                                'bogo','discount','informational']]
df_income2_2.shape
Out[556]:
(2443, 6)
In [557]:
# comparison test between the male and female customers whose income are in the reage of 50k to 75k

t_test2 = t_test(df_income2_1, df_income2_2)
t_test2
Out[557]:
([15.887242936017701,
  10.247036110092921,
  -1.1481205292152532,
  6.589895673943052,
  3.9432202518276327,
  3.4155947045898505],
 [0.0,
  0.0,
  0.8745191355181453,
  2.37609931730276e-11,
  4.06264846949389e-05,
  0.00032019320969445264])

从以上比较可以看出,收入在50k和75k之间的男顾客和女顾客在消费量,完成推送交易上有显著性,可以拒绝零假设。在消费次数上没有显著性,不能拒绝零假设。

In [558]:
# Group 3 -> income between greater than 75k, all customers

df_income3  = df.query("income > 75000 ")[['amount','offer completed','transaction',
                                                'bogo','discount','informational']]
df_income3.shape
Out[558]:
(4141, 6)
In [559]:
# Group 3_1 -> income between greater than 75k, male customers

df_income3_1  = df.query("income > 75000 and gender == 'M'")[['amount','offer completed','transaction',
                                                'bogo','discount','informational']]
df_income3_1.shape
Out[559]:
(1691, 6)
In [560]:
# Group 3_2 -> income between greater than 75k, female customers

df_income3_2  = df.query("income > 75000 and gender == 'F'")[['amount','offer completed','transaction',
                                                'bogo','discount','informational']]
df_income3_2.shape
Out[560]:
(2387, 6)
In [561]:
# comparison test between the male and female customers whose income are greater than 75k

t_test3 = t_test(df_income3_1, df_income3_2)
t_test3
Out[561]:
([1.2018197515985514,
  0.05972367575825823,
  -1.079813734476293,
  0.3361439166710856,
  -3.0443744129210843,
  -3.5710106019262096],
 [0.11475159579433603,
  0.4761893213628018,
  0.8598554881572607,
  0.3683898225888752,
  0.9988267324555211,
  0.9998201441358382])

从以上比较可以看出,收入高于75k的男顾客和女顾客在消费量,完成推送交易,在消费次数上都没有显著性,不能拒绝零假设。

In [540]:
# comparison test between the customers whose income are less than 50k and the customers 
#    whose income are in the reage of 50k to 75k

t_test4 = t_test(df_income1, df_income2)
t_test4
Out[540]:
([30.565307892830905,
  16.550252597677126,
  -7.703431341851828,
  10.64034252167871,
  10.476163720325173,
  6.813025081569449],
 [0.0, 0.0, 0.9999999999999927, 0.0, 0.0, 5.0812687391044165e-12])

从以上比较可以看出,收入低于50k的顾客和收入在50k和75k之间顾客在消费量,完成推送交易,存在显著性,能拒绝零假设。在消费次数上都没有显著性,不能拒绝零假设。

In [541]:
# comparison test between the customers whose income are less than 50k and the customers 
#    whose income are greater than 75k

t_test5 = t_test(df_income1, df_income3)
t_test5
Out[541]:
([56.54534915931741,
  22.72365924333532,
  -29.197956345263357,
  13.937736387322582,
  14.473183900108877,
  3.9280281635982828],
 [0.0, 0.0, 1.0, 0.0, 0.0, 4.324937616928537e-05])

从以上比较可以看出,收入低于50k的顾客和收入高于75k的顾客在消费量,完成推送交易,存在显著性,可以拒绝零假设。在消费次数上都没有显著性,不能拒绝零假设。

In [542]:
# comparison test between the customers whose income are greater than 50k and the customers 
#    whose income are in the reage of 50k to 75k

t_test6 = t_test(df_income3, df_income2)
t_test6
Out[542]:
([-30.582812808222258,
  -8.766731060908677,
  32.04019562417592,
  -4.959985039246813,
  -5.78977548055488,
  2.936824638743272],
 [1.0, 1.0, 0.0, 0.999999642056926, 0.9999999963762154, 0.0016615119984172422])

从以上比较可以看出,收入在50k和75k之间顾客和收入高于75k的顾客在消费量,完成推送交易,不存在显著性,不能拒绝零假设。在消费次数上有显著性,可以拒绝零假设。

In [565]:
# get and check train data

df_train = df_cu[['age', 'gender', 'income', 'membership_since', 'amount',
                  'offer completed', 'offer received', 'offer viewed']]

df_train.head(), df_train.shape, df_train.isnull().sum()
Out[565]:
(   age gender    income  membership_since  amount  offer completed  \
 0   55      F  112000.0               899   77.01              1.0   
 1   75      F  100000.0               966  159.27              3.0   
 2   68      M   70000.0               614   57.73              2.0   
 3   65      M   53000.0               690   36.43              5.0   
 4   58      M   51000.0               780   15.62              1.0   
 
    offer received  offer viewed  
 0             2.0           NaN  
 1             4.0           4.0  
 2             4.0           3.0  
 3             6.0           6.0  
 4             3.0           2.0  , (14825, 8), age                    0
 gender                 0
 income                 0
 membership_since       0
 amount                 0
 offer completed     2839
 offer received         5
 offer viewed         150
 dtype: int64)
In [566]:
# fill nan to 0

df_train.fillna(0, inplace=True)
In [567]:
# preprocess the gender field

df_train = pd.concat([df_train, pd.get_dummies(df_train['gender'])], axis=1)
del df_train['gender']
In [568]:
# remove outliers

df_train = remove_outliers('amount', df_train)
df_train.shape
-162.725 355.07499999999993
Out[568]:
(14353, 10)
In [569]:
df_train.head()
Out[569]:
age income membership_since amount offer completed offer received offer viewed F M O
0 55 112000.0 899 77.01 1.0 2.0 0.0 1 0 0
1 75 100000.0 966 159.27 3.0 4.0 4.0 1 0 0
2 68 70000.0 614 57.73 2.0 4.0 3.0 0 1 0
3 65 53000.0 690 36.43 5.0 6.0 6.0 0 1 0
4 58 51000.0 780 15.62 1.0 3.0 2.0 0 1 0
In [576]:
# preprocess the train data by standardization

X = df_train[['age', 'income', 'membership_since', 'F', 'M', 'O', 
              'offer completed', 'offer received', 'offer viewed']].values
Y = df_train['amount'].values

# standardize the first 3 columns
scaled_X = {}
for each in range(3):
    mean, std = X[:,each].mean(), X[:,each].std()
    scaled_X[each] = [mean, std]
    X[:, each] = (X[:, each] - mean)/std

y = (Y - Y.mean())/Y.std()
In [577]:
# import libs

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.metrics import r2_score

# train test data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[577]:
((11482, 9), (2871, 9), (11482,), (2871,))
In [578]:
# build random forest regressor model pipeline

pipeline = Pipeline([
    #('scalar', StandardScaler()),
    ('regr', RandomForestRegressor())
])

parameters = {
    'regr__n_estimators': (10, 50, 100, 1000),
    'regr__max_depth': range(3,7),
    
}

model = GridSearchCV(pipeline, param_grid=parameters)

model.fit(X_train, y_train)
print(model.best_estimator_)

y_pred = model.predict(X_test)
Pipeline(memory=None,
         steps=[('regr',
                 RandomForestRegressor(bootstrap=True, criterion='mse',
                                       max_depth=6, max_features='auto',
                                       max_leaf_nodes=None,
                                       min_impurity_decrease=0.0,
                                       min_impurity_split=None,
                                       min_samples_leaf=1, min_samples_split=2,
                                       min_weight_fraction_leaf=0.0,
                                       n_estimators=100, n_jobs=None,
                                       oob_score=False, random_state=None,
                                       verbose=0, warm_start=False))],
         verbose=False)
In [579]:
# find r square score

print(r2_score(y_test, y_pred))
print("MSE:", metrics.mean_squared_error(y_test, y_pred))
0.6110497729859442
MSE: 0.3753102988860114
In [584]:
# build alternative neural network model using Keras

import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor

# find input size
input_size = X.shape[1]

# define base model
model = Sequential()
model.add(Dense(64, input_dim=input_size, kernel_initializer='normal', activation='relu'))
model.add(Dense(16, kernel_initializer='normal', activation='relu'))
model.add(Dense(16, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))

# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
In [ ]:
# train the model

kfold = KFold(n_splits=10)
batch_size = 64
epoches = 80
cnt = 0
for train_index, test_index in kfold.split(X_train):
    cnt += 1
    print("Fold {}".format(cnt))

    X_tr, X_te = X_train[train_index], X_train[test_index]
    y_tr, y_te = y_train[train_index], y_train[test_index]


    r = model.fit(X_tr, y_tr, 
                  batch_size=batch_size,
                  epochs=epoches,
                  shuffle=True,
                  verbose=1,
                  validation_data = (X_te, y_te))
In [586]:
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_39 (Dense)             (None, 64)                640       
_________________________________________________________________
dense_40 (Dense)             (None, 16)                1040      
_________________________________________________________________
dense_41 (Dense)             (None, 16)                272       
_________________________________________________________________
dense_42 (Dense)             (None, 1)                 17        
=================================================================
Total params: 1,969
Trainable params: 1,969
Non-trainable params: 0
_________________________________________________________________
In [587]:
y_pred=model.predict(X_test)

print(r2_score(y_test, y_pred))
print("MSE:", metrics.mean_squared_error(y_test, y_pred))
0.5859888786163514
MSE: 0.3994923435358086

第四部分:结论和总结

1. 发现顾客以下特征。

  • 男性顾客随着年龄的增加平均完成推送交易正向相关的趋势。年龄小于57岁对男性顾客对买一送一对推送有正向关系,年龄在57岁以上此趋势不明显。

  • 60岁以下的男性顾客随着年龄的增加,平均购买次数有负相关的趋势,但是60岁以上的男性顾客此趋势不明显, 另外平均购买量有随年龄增加而增加的趋势。

  • 男性顾客随着年龄的增加,对折扣的响应数有正向相关的趋势。

  • 女性顾客随着年龄的增加对推送的正向响应比较平均没有明显的正向或是负向的趋势。

  • 同男性顾客有这类似的趋势, 女性顾客随着年龄的增加,平均购买次数有负相关的趋势,但是60岁以后持平。另外平均购买量是随年龄增加而增加, 但是70岁以后有下降的趋势。

  • 60岁以下女性顾客随着年龄的增加,平均对折扣推送有正相关的趋势,但是60岁以上有所下降。48岁以下对女性客户随着年龄的增加,平均对信息推送有正相关的趋势,但是48岁以上趋势不明显。

  • 收入低于50k以下的男顾客和女顾客在消费量,完成推送交易,购买次数上有显著性,可以拒绝零假设。

  • 收入在50k和75k之间的男顾客和女顾客在消费量,完成推送交易上有显著性,可以拒绝零假设。在消费次数上没有显著性,不能拒绝零假设。

  • 收入高于75k的男顾客和女顾客在消费量,完成推送交易,在消费次数上都没有显著性,不能拒绝零假设。

  • 收入低于50k的顾客和收入在50k和75k之间顾客在消费量,完成推送交易,存在显著性,可以拒绝零假设。在消费次数上都没有显著性,不能拒绝零假设。

  • 收入低于50k的顾客和收入高于75k的顾客在消费量,完成推送交易,存在显著性,可以拒绝零假设。在消费次数上都没有显著性,不能拒绝零假设。

  • 收入在50k和75k之间顾客和收入高于75k的顾客在消费量,完成推送交易,不存在显著性,不能拒绝零假设。在消费次数上有显著性,可以拒绝零假设。

2.机器学习建模。

  • 以顾客的年龄,收入,性别,会员天数, 以及汇总的交易信息为特征,来预测消费量。

  • Baseline模型为 Random Forest Regressor,R^2为0.61, 备选NN模型,R^2为0.58